Skip to main content

Excel Processing Guide

This guide explains how the Excel processing system works, its components, and how data flows through the system from input to output.

Processing Pipeline

The Excel processing system follows a multi-stage pipeline:

  1. Input Processing
  2. Layout Analysis
  3. Feature Detection
  4. Data Extraction
  5. Post-Processing
  6. Output Generation

1. Input Processing

The system begins by reading and validating the Excel file:

Key components:

  • File validation ensures proper format and accessibility
  • Sheet detection matches configuration patterns
  • Header analysis identifies column structures
  • Initial data structure creation for processing

2. Layout Analysis

The system analyzes the sheet layout to understand its structure:

Table Detection

  • Identifies table boundaries
  • Determines header positions
  • Analyzes merged cells
  • Maps column relationships

Layout Types

  1. LAYOUT Type

    • Complex structure with multiple sections
    • Contains formatting-based features
    • Supports cluster detection
  2. OFFERS Type

    • Simple tabular structure
    • Direct field mapping
    • Standard data extraction

3. Feature Detection

Features are detected based on cell properties:

Cell Properties Analyzed

  • Font properties (color, style, size)
  • Border properties
  • Background colors
  • Cell formats
  • Content patterns

Feature Search Process

{
"FeatureSearch": {
"SEARCH_NAME": {
"TYPE": "OR_ARRAY",
"FEATURES": [
"FEATURE_PATTERN_1",
"FEATURE_PATTERN_2"
]
}
}
}

4. Data Extraction

Data extraction varies by sheet type:

LAYOUT Sheets

  1. Cluster Detection

    • Identifies logical groups of data
    • Applies container rules
    • Processes sub-clusters
  2. Feature Extraction

    • Extracts based on formatting
    • Applies pattern matching
    • Processes special cases

OFFERS Sheets

  1. Direct Mapping

    • Maps columns to fields
    • Applies transformations
    • Validates data types
  2. Data Validation

    • Checks required fields
    • Validates data formats
    • Handles missing data

5. Post-Processing

After extraction, data undergoes post-processing:

Field Processing

{
"FieldValueReprocessing": [
{
"FIELD": "FIELD_NAME",
"RULES": [
{
"VALUE": "OLD_VALUE",
"REPLACEMENT_VALUE": "NEW_VALUE",
"CONDITIONS": [
{
"FIELD_A": "=VALUE_A",
"FIELD_B": "~=PATTERN_B"
}
]
}
]
}
]
}

Data Enrichment

  • Applies field mappings
  • Adds computed fields
  • Implements business rules
  • Performs data validation

6. Output Generation

The system generates output in the specified format:

Output Options

  • CSV export with configurable delimiter
  • HTML feature output
  • Structured JSON data
  • Custom format exports

Core Components

ExcelReader

Responsibilities:

  • Read Excel file data
  • Validate file structure
  • Extract sheet information
  • Process workbook metadata

LayoutProcessor

Responsibilities:

  • Detect table structures
  • Process data clusters
  • Extract layout features
  • Handle complex layouts

FeatureProcessor

Responsibilities:

  • Search for features based on patterns
  • Validate feature matches
  • Extract feature data
  • Process feature relationships

Data Structures

Feature Matches

Key data points:

  • Pattern information
  • Cell location
  • Value data
  • Formatting information

Cluster Data

Organization:

  • Cluster identifier
  • Cell range information
  • Associated features
  • Sub-cluster relationships

Advanced Features

Custom Processing Rules

Example configuration:

{
"CustomRules": {
"RULE_NAME": {
"TYPE": "CUSTOM",
"PROCESSOR": "CustomProcessor",
"CONFIG": {
"param1": "value1",
"param2": "value2"
}
}
}
}

Data Validation

Configuration example:

{
"Validation": {
"FIELD_NAME": {
"TYPE": "PATTERN",
"PATTERN": "^[A-Z]{2}\\d{4}$",
"ERROR_MESSAGE": "Invalid format"
}
}
}

Best Practices

Performance Optimization

  1. Sheet Processing

    • Use specific sheet targets
    • Limit search ranges
    • Optimize pattern matching
  2. Memory Management

    • Process large files in chunks
    • Clean up temporary data
    • Use streaming when possible
  3. Error Handling

    • Implement proper validation
    • Log processing errors
    • Provide clear error messages

Configuration Management

  1. Pattern Organization

    • Group related patterns
    • Use meaningful names
    • Document complex patterns
  2. Feature Definition

    • Define clear search criteria
    • Use specific patterns
    • Document special cases

Troubleshooting

Common Issues

  1. Performance Problems

    • Check search patterns
    • Verify cluster configurations
    • Monitor memory usage
  2. Data Extraction Issues

    • Validate feature patterns
    • Check field mappings
    • Verify post-processing rules
  3. Output Problems

    • Check format settings
    • Verify field mappings
    • Validate data types

Debug Process

  1. Enable relevant logging:
{
"LOGGING_CONFIGS": {
"EXCEL_READER_LOGGING": "true",
"FEATURE_SEARCH_LOGGING": "true",
"PROCESSOR_RULE_LOGGING": "true"
}
}
  1. Check processing stages:

    • Verify input file
    • Check sheet detection
    • Validate feature matches
    • Review data extraction
  2. Review output:

    • Check data format
    • Verify field values
    • Validate transformations

Error Recovery

Processing Errors

Common error scenarios:

  • Missing features
  • Validation failures
  • Pattern matching errors
  • Data type mismatches

Data Recovery

Recovery strategies:

  1. Save processing state
  2. Enable restart capabilities
  3. Implement fallback options

Testing Guidelines

Component Testing

Test areas:

  • Feature detection accuracy
  • Data extraction reliability
  • Pattern matching precision
  • Error handling effectiveness

Integration Testing

Test workflow:

  1. Input processing
  2. Feature detection
  3. Data extraction
  4. Output generation

Development Guidelines

  1. Follow consistent naming conventions
  2. Document complex logic and algorithms
  3. Implement comprehensive error handling
  4. Write thorough tests
  5. Maintain clear configuration standards

This guide provides a comprehensive overview of the Excel processing system. For specific implementation details, refer to the API documentation and code examples.